Overview

The purpose of this notebook was more about learning and experimenting with time series data using a traditional machine learning approach to make future predictions. There were some unique elements to this data that made it a bit challenging to work with. The initial challenge was to develop hierarchical modeling solutions such that one set of predictions could be aggregated up to eventually get the overall total sales at the store level (or vice versa, bottom-up vs top down approaches). For this exercise, I just wanted to refresh myself on doing some time series predictions with an LGBM type of model, mostly to further engrain good CV strategies to handle these types of everyday problems or situations. I ended up generating predictions for just a single store and department (foods dept). I just chose to try and model the Food sales for a single store in CA but I think this could pretty easily be extended to be at the category-store level, which would require training 30 different models (10 stores x 3 departments for each store) and making 30 different sets of predictions. The different departaments are: Food, Hobbies, and Household items. More reading about other details of this competition can be found here: https://www.kaggle.com/c/m5-forecasting-accuracy/overview

This competition had two different pieces, one where contestants made predictions and the other was adding prediction errors to those estimates using the specific metric the competition has asked contestants to compute. It seemed like a pretty complex and hard to understand metric, thus, I did not bother with that here and stuck with the more common and easy to grasp RMSE (root mean squared error). I think there are reasons they chose the specific error metric but my understanding is that those reasons are also based on hierarchical modeling concepts, which I didn't really apply here.

Modeling Strategy

Features

For the mdoeling strategy, I kept it relatively simple. I used a few different lag variables. One interesting note on the lags is that it appeared a 28 day sales lag had a pretty strong correlation with the current day sales so that lag variable was used as the main sales lag variable. We can't use lags that are more recent because we won't have that information when we go to make predictions. At least, not without generating dynamic features which I did not do here, but one could envision that for each day out you are forecasting from the current day, you generate a new set of lag features so you are always using the most recently available data, rather than just a static lag that is the maximum prediction time period. But that seems overly complex and I'm not 100% clear on how that would work in practice. So for each day in our 28 day prediction window, we'd be using the lag from 28 days prior to that day, regardless of which day in the prediction window it is. Other features used in the model were a holiday indicator (day of, day before and day after), and some rolling means of sales as well.

Another way of doing this type of modeling would be to modify the outcome variable (y) rather than the indepdent variables or input variables (X). I did not use this approach here but it is an option. The y's would simply shift back 28 days so that for a given date, you're using the outcome (here, it's sales) from 28 days after time t and trying to predict that.

Cross-validation

My cross-validation strategy was also relatively straightforward and I thought it provided a pretty realistic approach for how the model would be used in an actual business environment. I actually attempted two different CV strategies just to get the practice. The initial strategy was to create a rolling window, using some fixed time period for training, and then made predictions on the next 28 days. Each fold consisted of 70-80 consecutive days of training data, followed by 28 consecutive days to test the model on. After making predictions on the first test set in the first fold, it would move to the next fold and do the same thing, each time moving the training window forward in time as well as the prediction window. The training window was set to a constant size and the prediction window was always 28 days immediately following the end of the training set.

I also attempted to use an expanding window CV strategy to see how this impacted model error. With this strategy, I started with a small training set and gradually expand it as we move forward in time. What I discovered is that the model error tended to decrease as we increased the training set size at least once we got past a certain point in the training data size, which was interesting. I made a few plots to illustrate what this looked like as well further down in the notebook. I also should point out that there are tradeoffs with this approach, since the window keeps expanding, it requires a lot more resources to run and creates a lot more folds to iterate through making it much more resource intensive. Running the rolling window took less than a minute to run and get the error metrics for all folds, but the expanding window strategy took about 10 minutes, and this was a relatively small sample of the data, only ~1900 rows total. If you had hundreds of thousands of rows or millions of rows, you'd almost certainly need to run that type of CV in AWS Sagemaker or an equivalent cloud computing resource, assuming you were using nearly all of it for training.

Results/Scoring

The model seemed to perform relatively well predicting sales at the store-category level, at least for the store I chose to use. I used RMSE as the main scoring metric due to its interpretability and was able to achieve an average RMSE of ~300-350 predicting the food sales at the CA_1 store. Average daily sales in each of the prediction periods was ~3000 items sold. This was lower than the RMSE of the baseline/naive model where I just used the previous 28 days as the predictions for the next 28 days. That RMSE was around 420. I'm not sure where an RMSE of 350 would rank among all entrants in this competition (probably not super high) but probably not last either. I was going through some notebooks of those who entered the competition and saw some extremely low RMSEs, some of which seemed almost unbelievable (e.g. RMSEs < 50!). Though, when looking through some of the codes, it wasn't super clear how some of these people were validating their model predictions/error as not everyone posted all of their code.

So calendar goes from 1/29/2011 all the way up through the final day (1969), or 2016-06-19. Will need to do some work on this to make it easier to include in the final model. We want to give the model the best chance of predicting a 28 day forecast successfully so we need to align this data with the training data somehow.

It looks like we know if there's a calendar even for any given day

Looks like there are periods where sales really plummet in December, likely the Christmas holiday. Good to keep in mind for modeling!

Also, there's a pretty big change in volume between 2015-2016. What happened there? New products introduced?

So this appears to show that there were no new products offered for that store during that time period. Something to keep in mind though.

There definitely looks like there is some seasonality to food sales where sales peak in April-July and bottom out in December due to the holidays.

The interesthing thing about this plot is that it shows the 28 day lag having a higher positive correlation with our outcome (sales) than any other lag variable, even the 1 day lag. That is interesting! The next closest is the 7 day lag.

Some Modeling Prep/Feature Engineering

Seems like the event types don't agree so let's just combine these into a single 'holiday' indicator for a given date. There are a few categories and I think just coding the new 'holiday' field to be 1 if it is a national holiday is probably okay. It might end up obscuring some info about whether some holidays have more impact than others but let's see how it goes.

Double check the coding looks as expected

Looks good!

Now, just keep a few of these fields to join to our lag dataframe

Now let's start doing some modeling! I'm going to just leave these features in and see how the model does. There's def room to create more features though like moving averages and such.

Note

I wanted some additional practice creating a different cross-validation strategy as well. Here, instead of the rolling window, I'm using an expanding window. The logic is somewhat similar as the above rolling window but there were a few changes needed. Both of these functions could be pretty useful whenever doing cross validation on time series models, particularly with machine learning models.

This is interesting as it seems to show that there is kind of an inflection point with regard to the training window size. Right around a 1100 train samples seems to be the inflection point, where the pre/post means (< 1100 and > 1100) of RMSE would be quite different. Also seems to illustrate that as we give the model more historical data to learn from, it tends to do better on future predictions.

If we want to do grid search, the code below allows us to do that to find the best model params. And we are being sure to feed the CV splits from above to GridSearchCV as well.

So right now, the model does beat the baseline model where we just used the previous 28 days values as the next 28 days future sales values. There are still a couple things we could modify in the LGBM model though to improve performance. I didn't actually end up running grid search with the expanding window folds because I think it would take too much memory on my computer but would definitely be worth doing with more compute resources. It might further improve performance of the model.